Teachers : Prof. M.-O. Boldi - Prof. M. Baumgartner Fall Semester 2022
We decided to analyze a TV series for this text mining project. In particular, one of the most successful comedies of the last years: The Big Bang Theory. We looked at the script of the 10 seasons of this series to make a detailed analysis, and try to understand, without having seen the series, the general framework that emerges. Our objective is to produce a detailed report based on an original database.
Data set web-scraped from https://bigbangtrans.wordpress.com/.
Our goal is to use the relevant text mining - machine learning tools, with supervised and unsupervised learning methods to characterize our data frame. In our case, it would be a question of understanding the framing of the TV show, with sentiments analysis, vocabulary richness, and topics analysis.
The data we are using within this project is coming from the website “Big Bang Theory Transcripts”. It is accessible in the following link : https://bigbangtrans.wordpress.com/
We decided to web-scrap the data and to create csv files to stock them. Indeed, it is easier for us to get the data and have it locally on our files so that whenever we want to work with them again, we do not need to web-scrap again. We will directly be able to use the files we created.
We crated many different files as we want to make several analysis.
The first csv file is called “series_scripts.csv”. It is available in the “data” folder of our project. This data set contains 231 rows and 3 columns:
We do not show how the data look like in here, simply because the scripts column contains a lot of text and if would take way too much place in the report. You are invited to open the csv files if you want to get an overview.
We created a second csv file named “season_scripts.csv”. Indeed, we quickly realized that the analysis per episode would fast become tedious and less meaningful. Therefore, we decided to aggregate the episodes by season. This way we get all the scripts of each season’s episode on one concatenated string.
This data set contains 10 rows, each row representing one season and 2 columns:
Since the table is quite long and even one row is very long to output, we did not add an annex of the data frame output. We will recommend you to directly go on the csv file if an overview of the data set is needed.
We created 10 other files based on the web-scraping. Indeed, we decided to have one file per season because we wanted to have one row new row each time a character is speaking.
It means that each file has a different length depending on how many time there is a change of interlocutor. However, each file has the following column structured:
Then we combined all these rows to have one main file named ‘character_speech.csv’. The two first rows are printed below so you can have an overview of this dataset.
| season | main_character_script | character_name | character_scripts | |
|---|---|---|---|---|
| 3 | 1 | Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits. | Sheldon | So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits. |
| 4 | 1 | Leonard: Agreed, what’s your point? | Leonard | Agreed, what’s your point? |
In this part, we perform an Exploratory Data Analysis. We will clean and infer some first results based on our data sets.
As we know that we want to conduct a sentiment analysis on the the seasons of the series, we first define the corpus for our analysis. The corpus will be the agg_scripts column in our season_scripts data set as it contains all the texts that must be analysed. Once defined, we clean the texts by removing the numbers, the punctuation, the symbols, the separators and the stop words appearing in the English language.
In our scripts, we can see the main characters that are Sheldon, Leonard, Penny, Howard and Raj. They appear in each seasons’ episodes. Therefore, we remove their name for the following explanatory data analysis since it will otherwise bias our results.
We have created a variable ‘characters_names’ grouping the names of all the recurring characters during the series. Indeed, these are, logically, the words that come up the most often and this is not the purpose of our analysis here. It is the same with the words grouped in the variables ‘words_to_remove’, which are themselves linked to the stage directions or do not bring any added value.
In here, we also perform the lemmatisation which is replacing some vocabulary by removing the inflectional endings and only keep the base or the corresponding form from the dictionary of word. It will give the lemma of each word from that dictionary.
We use the TF-IDF method to look at the specificity of the terms through seasons. It allocates a weight to the most frequent words and translate a relevance of terms in the corpus. With the following graph, we can see that the most frequent term, is by far ‘Time’. Indeed with the frequency matrix, we can see that ‘Time’ appears more than 900 times in the whole show.
We observe that the most frequent terms are part of the feeling lexical field. For instance the words ‘love’, ‘feel’, ‘fine’. With only this first representation, we can think of a positive pattern in the show.
The bigger the word, the more frequently it appears. The most frequent words seems to be “time”, “guy”, “love”, “feel”, etc. Does this means that the seasons have mostly positive sentiments ? We will analyse that throughout our report.
Then, we plot the 10 most frequent words per season and observe that it did not bring much information. Therefore we did not show the output of the graph here. Indeed, it is not really relevant because ‘time’ and ‘guy’ are always predominant through all the corpus and for each season, the plot was telling us the same thing. It looks logical to have them appearing individually.
Again, here we plot the 10 most specific words per season. In the season 1 the term ‘gablehouser’ is very specific. While in the season 4 the most specific term is ‘sheldon_bot’, we could imagine that they were trying to build a robot on Sheldon. If we look at season 10 the verb ‘born’ is very specific to this season, we can guess maybe an important event happened there.
On this plot, we cannot see very clearly as there are a lot of terms. But we see that “time”, “guys” are indeed very frequent as seen previously and that “gablehouser” and “sheldon-bot” are very specific to one particular season each. For the term “gablehouser”, it is season 1 after research. In the show it is actually a character Dr. Eric Gablehauser, he does not appear in the show after the 2 first season and we did not notice him before to remove it with the other characters’ name.
We compute the lexical diversity of the scripts and we see that the season, especially the last (8, 10, 7, 9) do not have very diverse lexical. Indeed the TTR is equal to 0.25, which means that in a sequence of 4 words, 3 words are the same and only 1 is different. The season with the richest lexical seems to be the season 2 iwth a TTR score almost 0.5 .
We made a Keyness analysis, to understand what is the ration of the terms in the target compared to others in the rest of the corpus.
In the first graph, we analysis season 5 as the target. The reference is the rest of the corpus, and we observe that the word ‘siri’ is the most used in the script of season 5 compared to the rest.
In the second graph, we analysis season 7 as the target. The reference is the rest of the corpus, and we observe that the word ‘element’ is the most used in the script of season 7 compared to the rest.
Next, we decided to create a co-occurrence matrix to have an overview on which word often appears together.
Because it is very difficult to see when we have too many words, we decided to restrict the co-occurrence matrix to terms that co-occur more than 300 times together.
The representation matrix helps us to understand how many times the most frequent words co-occur together in all the corpus. For example, ‘time’ and ‘guy’ co-occur 138’402 times together in the corpus, which means that they are used in the same context in the script.
Below is a plot of the co-occurrence graph. Each connection means that the two words appear together more than 30000 times. We see that at the center we have the terms ‘time’, ‘talk’, ‘guy’ which means that these words often appear along with the others.
As with the season data we decided to make a tokenization and to do some pre-process cleaning of the data. Indeed, we removed all the numbers, punctuation, symbols, separators, stopwords, the same vector containing the characters’ names and the vector containing the words we judged not insightful for our analysis. Then we conducted a lemmatisation.
We show here the most used words per character. We see that leonard often uses the word ‘love’ and Penny often uses the word ‘fine’. Our first idea could be that Leonard is a very positive person, if the word ‘love’ is often pronounced by him.
Next we want to have an idea of what word is specific to which character. Therefore we plot the 10 most specific terms per character. Interestingly, the terms are very similar, but the allocation to the character is slightly different. From this plot, it seems that the word ‘remarkable’ are quite specific to the character Howard. While the terms ‘lord’ and ‘beverage’ are quite specific to Leonard. Maybe Penny has a tick word which is ‘hee’, as it is quite very specific to her.
This plot allows us to confirm our first insight. Again, we have alreasy seen previously in the season analysis that terms such as ‘time’, ‘guy’ and ‘talk’ are very frequent throughout the whole seasons and here they are present throughout each character so they are not specific to anyone. On the contrary, we see words such as ‘hee’ and ‘remarkable’ are very specific to a certain character.
We compute the lexical diversity of each character. It seems that Raj has the most diverse lexical with a Token-Type-Ratio of a bit mit more than 0.3. Indeed, in the series, he sometimes even speak in Hindi. Surprisingly, Sheldon has the less diverse vocabulary with a TTR of less 0.25, we were expecting more since he seemed to be the most well-known character of the series.
For the EDA of the character, we did not judge relevant to perform a co-occurrence analysis as the script is the same regardless if it is separated per season or per characters.
In this following part, we want to compute the sentiment of each season. To do so, we decide to first used the ‘NRC’ dictionary to perform the analysis. Then we want to compare our results to another analysis using another dictionary, the ‘afinn’ dictionary.
The NRC dictionary contains a list of English words and their associations with eight basic emotions and two sentiments (positive or negative). These emotions are anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Based on this dictionary, one English word can be associated to several emotions. For example, we see on the following table that the term ‘abandon’ is associated to several emotions (fear, negative, sadness).
For each token in our data scripts, we join the corresponding
sentiment qualifier in “nrc” using the inner.join()
function from dplyr: Below, you can see the first 10 rows
of the dictionary.
season | word | sentiment |
1 | bank | trust |
1 | agreed | positive |
1 | agreed | trust |
1 | excuse | negative |
1 | prince | positive |
1 | prince | positive |
1 | bank | trust |
1 | fill | trust |
1 | time | anticipation |
1 | wait | anticipation |
To compare the documents, each season, we rescale them by their
length (i.e. the frequencies of sentiments are computed, by document):
By re-scaling, we see that all seasons follow the same pattern, meaning
that they are mainly positive and reflecting the sentiment of trust
while very few of disgust sentiments appear. We can understand that the
pattern is rather recurrent based on the consistency on the characters
humors and mood. Also if the comedy works well, the directors will
follow the same process over and over the season. The feeling of trust
is also well present which might reflect the friendship aspect. Indeed
it is the main component of the show.
Now, on this part, we will us the AFINN dictionary. This dictionary contains a list of English words manually rated for valence with an integer between -5 (very negative) and +5 (very positive) by Finn Årup Nielsen.
We see on the below table that the word ‘abandoned’ has a value of -2 which means that it is relatively negative.
| word | value |
|---|---|
| abandon | -2 |
| abandoned | -2 |
| abandons | -2 |
| abducted | -2 |
| abduction | -2 |
| abductions | -2 |
We see there on the table below that the seasons 3, 2 and 7 are categorized as relatively negative with season 3 being the most negative.
season | Score |
3 | -0.04703390 |
2 | -0.01577287 |
7 | 0.02644320 |
4 | 0.10514457 |
1 | 0.11864407 |
10 | 0.19701493 |
5 | 0.23056402 |
8 | 0.30131827 |
6 | 0.31364829 |
9 | 0.34000728 |
From this analysis we can observe that the second half of the show,
except for season 7, has a more positive score. We can suppose that the
characters tend to become more positive and maybe less sarcastic as the
series goes on
From another dictionary named ‘data_dictionary_LSD2015’, we see pretty much the same analysis.
Valence shifters are words that alter or intensify the meaning of the polarized words and include negators and amplifiers.
Valence Shifter Value Negator 1 Amplifier (intensifier) 2 De-amplifier (downtoner) 3 Adversative Conjunction 4
We can see with this density graph, the average sentiment on the wole
season, with the valence shifters. It tends on average to be superior to
0, the mean around 0.8.
The analysis is possible by going through each sentence. Looking at
season 9 for instance, the end of the show looks to be very positive
with an increasing trend. The pattern shows that sometimes we have high
peaks like in season 4. And sometimes very low peaks but it remains well
distributed. Indeed if we take a look at season 10, we have 4 drops in
the negative and it looks to appear every 1000 sentences.
With this barplot, we can once again identify the scaled average
sentiment and see that the most positive season is 10 and the least
positive season is 2. The main difference with the valence shifter shows
that season 10 is way more positive with this analysis compared to the
previous one. It shows how important we must consider this aspect.
In this part, we want to compare the similarities of the scripts between the season. We decided to use the 3 similarities measure to compute the similarity matrix: Jaccard similarity, Cosine distance, and Euclidean distance. The best/explanatory method in our case seems to be the Euclidean one and we will therefore concentrate on it for the following.
From the Euclidean co-occurence matrix plot, it seems that season 4 is
close to every other seasons. On the contrary, season 6 and season 2 are
more distant (but we know that there may be a problem in
webscrapping).
Then, to create a cluster, we decide to focus on the Euclidean distance only.
## Clust.1 Clust.2 Clust.3
## 1 gablehouser siri sheldon-bot
## 2 hee switzerland latham
## 3 no-one bom glenn
## 4 leo flags todd
## 5 halo crawley troll
Not surprisingly here, we find the same characteristics for seasons 4, 2 and 6 as before regarding their proximities.
We use the cosine distance measure to determine the similarities between words.
We decided to represent the similarities of words with a cluster
dendogram rather than a matrix. With the matrix the the interpretation
is harder to read. The method used here in the cluster distance is 1 -
Similarities (cosine). As a result : ‘feel’ and
‘happy’ are really close. However when we compare
‘live’, ‘baby’ and ‘night’ and
‘friend’ are very distant. Indeed if it refers to ‘baby’ as a
child, I guess it is not used in the same scene as ‘night with
friend’.
We want to analyze the topics of the season scripts using LSA and LDA.
As the first dimension is often linked to the document length, we wanted to verify that this was the case. And indeed, the dimension 1 is negatively correlated with the document length.
Then we did an analysis of topics 2 and 3.
In order to visually represent the relationship between topics 2 and 3, seasons and words, we perform an LSA-based biplot. Because of the large number of terms, the interpretation is difficult. Below, you can see the chart to the terms that are mostly related to the dimensions 2 and 3
The biplot shows that Topic 2 is associated with seasons 7,8,9
and 10 and with the words “love”, “baby”, “happy”,“guy” and
anti-associated with season 3 and with the words “ring”, “friend”,
“mother”. Topic 3 is associated with seasons 1 and 4 and with words
“enter”, “gablehouser”, “machine” and anti-associated with “ring”,
“fine”, “day”.
We now turn to an LDA. We started with 10 topics and then eliminated the non-miningful ones until we get to choose 5 topics. These topics are related to the words below.
## topic1 topic2 topic3 topic4 topic5
## [1,] "page" "past" "play" "baby" "time"
## [2,] "sister" "enter" "wed" "feel" "guy"
## [3,] "sheldon-bot" "leave" "hawk" "birthday" "talk"
## [4,] "latham" "voice" "space" "flag" "friend"
## [5,] "girlfriend" "alright" "kripke" "kitchen" "call"
The “phi” provides the probabilities of selecting a term given that it is a given topic. For a given topic, the largest phi provide the terms that are most associated with the topic.
Here we plot the 5 largest probability terms within each
subject. Despite the fact that these terms have the ighest probability
to appears in this topic, their phi values are relatively low and
therefore they are only slightly more common than the other terms.
The “theta” provides the probabilities (i.e., proportions) of the topics within each document (season).
This graph shows us that Topic 5 is present at more than 50% in all seasons. Topic 1 is more related to season 4, Topic 2 to seasons 1,2,3, Topic 3 to seasons 5,6,8 and Topic 4 to seasons 9 and 10. It seems to be a chronological link between the topics and the seasons.
After the analysis of the script according to the seasons, we wanted to see how the five main characters (Sheldon, Leonard, Penny, Raj and Howard) of the show impact the script of the show through a sentiment analysis.
We use a data set with which an observation is given for a character according to the sentence he says. Thus, we can use again here the ‘nrc’ lexicon and have an idea of the dispersions of the feelings for each of our characters.
## `summarise()` has grouped output by 'character_name'. You can override using
## the `.groups` argument.
First, we see in this graph that Sheldon seems to be the most
‘intense’ character. In the sense that he is the one who uses the most
words that can be categorized by a feeling. Then we notice an identical
pattern in all the characters. Indeed, we have a prevalence of positive
words then negative and on the contrary less words related to the
feelings ‘trust’ and ‘disgust’.
Since in the previous analysis negative and positive feelings predominate, we wanted to try to use another dictionary. This is a General purpose English sentiment lexicon that categorizes words in a binary fashion, either positive or negative.
We obtain a surprising result considering our previous findings.
Indeed, we notice that the ‘negative’ represents a major part in all the
characters. This is contradictory with the results of the nrc lexicon
(why ?). We also notice that Sheldon is the most negative character and
Penny the most ‘positive’ character. This analysis is consistent with
our previous results. Also, we can imagine that some seasons are more or
less pleasant for our characters. For example, Raj seems to have used
more ‘positive’ words in season 9 and Leonard in season 2 while Sheldon
uses more negative than positive words in seasons 1, 3 and 7.
The analysis is possible by going through each sentence. The
most ‘instense’ character is Sheldon, he appears to be very expressive
in the positive like in the negative. However we can guess that he tends
to be less and less negative through the end of the show. The lowest
peaks are a little bit less frequent.
With this barplot, we can once again identify the average
sentiment and see that the most positive character is Penny with an
average of 0.13, followed by Leonard. The least positive character is
Sheldon.
First we create the corpus from the data set “character_speech”. Within this data set every line is coupled to the characters (Howard, Leonard, Penny, Raj, and Sheldon). The variable y is the character name (Howard, Leonard, Penny, Raj, and Sheldon) and is the variable we want to predict. This prediction is based on the previously mentioned lines in the script. After this we create a DFM from the tokenized corpus of the characters and their corresponding speech.
Next, we train the classifier. First we combine the target variable and LSA together in a data frame. We then take a sample of 80% of this data frame as the train set. The other 20% will be used as the test set. We then train the classifier with the ranger package. We then predict and show the results in a confusion matrix of the caret package. It can be noted that the base rate (here called “No Information Rate”) is 0.2946. With an accuracy of 0.3572, it can be concluded that it does better than it would by random sampling. However, it also can be said that this accuracy is quite low. Therefore, in the next couple of paragraphs we look at further improving the model and its accuracy.
First, we transform the DFM to LSA, as we did in the previous paragraph. However, now we try to see for which number of dimensions the model gives the highest accuracy. A maximum number of 1000 dimensions is chosen, as with these dimensions the run time is already very long and the accuracy does not seem to increase significantly after 1000 dimensions.
The different accuracies for the number of dimensions 2 ,5, 25, 50, 100, 500, 1000 are respectively 0.2597865, 0.3128437, 0.3303138, 0.3377548, 0.3513426, 0.3581365, and 0.3568424. Due to the long run time and the fact that the accuracy curve is flattening, we choose a number of dimensions of 100, as this has a relative high accuracy, while taken the run time into account. We thus choose for a number of 100 dimensions (nd = 100) for the DFM and LSA.
Second, we now make the choice to try to further improve the model by first transforming the DFM into a TF-IDF. As we did in the paragraph above, we again try to figure out for which number of dimensions the model gives the best accuracy. The resulting accuracies are 0.2761242, 0.3023293, 0.3485927, 0.3489162, 0.3558719, 0.3587836, and 0.3536072 for respectively the number of dimension 2, 5, 25, 50, 100, 500, 1000.
After running several scenarios with different dimensions, we again choose to use 100 dimensions, as more dimensions will increase the run time immensely, while the improvement on the accuracy is minimal. Furthermore, we choose to use the tf-idf, as this outperforms the dfm by a small margin, namely 0.3513426 for the dfm and 0.3558719 for the tf-idf.
We now rerun the model with the chosen dimensions, so we can further improve on the accuracy with word embedding.
We choose 100 as number of iterations, as the loss does decrease with more iterations, but the run time becomes too extensive, compared to the improvement made with every additional iteration. So we make the arbitrary decision to keep the number of iterations at 100. Increasing the rank from 25 to 50 (with 100 iterations) decreases the loss from 0.0202 to 0.0051 and an accuracy of 0.3569 and 0.3692 respectively. Using a rank of 100 gives a loss of zero and an accuracy of 0.3614. It is thus interesting to see that the accuracy decreases for this value of 100 for the rank, while the loss decreases. As a rank of 50 returns the highest accuracy for the chosen values for the rank, we use this value in our further analysis.
For a window of 1, giving a loss of 0.0052 and accuracy of 0.3692 with a rank of 50. As the paragraph above has shown, a lower loss does not mean a higher accuracy, we thus compute both the loss and the accuracy for each window. Increasing the window to 5 gives a loss 0.0354 and an accuracy of 0.365. Decreasing the the window to 3 gives a loss of 0.0253 and an accuracy of 0.3687. Again, we decrease the window to 2 which shows a loss 0.0168 of and an accuracy of 0.3685. Again, the difference is accuracy is small, however as a window of 1 results in the highest accuracy, we chose this as our base for the further improvements with GloVe.
We tried another case where we add the length of the sentences. However, it decreases the accuracy to 0.3596. The explanation for this might be for the fact that each character has similar average lengths of lines of the script, as it is a very large corpus we are working with.
Also, it seems that adding the centers will decrease the accuracy, namely to 0.3621. Thus, we get no further improvement based on combining the centers from the GloVe model and the tf-idf, compared to the GloVe model by itself.
Out of curiosity and interest, we also tried to see how this accuracy would materialize in practice. The idea was to come up with a random sentence and see how the model would classify it. Sadly due to time constraint, we were not able to figure out the code to successfully predict a character name based on a random sentence. The code we tried is accessible within our files.
Concluding from the previously run code, it can be said that using the GloVe model by itself gives the highest accuracy. Two things should be noted however, namely that the difference in accuracy does not change a lot with the different methods. Furthermore, although the accuracy is higher than the base rate, it does not increase significantly. An explanation might be due to the fact that it are fictional characters, which differ in the show by personality, but not a lot by vocabulary as seen in the Exploratory Data Analysis with the token-to-ratio graph. In other words, as most scenes in the series are based on conversations between the characters, similar topics and words will be discusses by the characters. This causes the same kind of distinctive words to be used by multiple characters. Also, since Penny is perceived as the less intelligent character within the series, we would have expect the models to be able to predict Penny pretty well. However, from the confusion matrices it can be concluded that this is not the case.
Given the context of this series, we could have expected a much more scientific vocabulary. But we realized that it fits a lot in the codes of American sitcoms that puts forward the relationships between the characters, the dramas, and the typical subjects that we can expect from people of their age.
The second half of the show seems more positive score. We can suppose that the characters tend to become more positive and maybe less sarcastic as the series goes on.
Concerning the characters, Sheldon is the star of the show and this is felt in the analysis. He is always the one who talks the most and is the most intense.
The second half of the show, except for season 7, has a more positive score. We can suppose that the characters tend to become more positive and maybe less sarcastic as the series goes on.
Some limits we encountered, could be :
As a future work, we might want to improve the Machine Learning part. Indeed, as already explained, the accuracy is not very large so we could try to find new techniques to improve or even try other machine learning model. Another aspect could be to continue the part we started about the prediction of a character based on a given sentence.